1 Getting ready

First we load the necessary packages.

library(here) # A Simpler Way to Find Your Files, CRAN v1.0.1
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0 
library(dada2) # Accurate, high-resolution sample inference from amplicon data, Bioconductor v1.18.0  
library(plyr) # Tools for Splitting, Applying and Combining Data, CRAN v1.8.6
library(DT) # A Wrapper of the JavaScript Library 'DataTables', CRAN v0.17 
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN v4.9.3
library(biomformat) # An interface package for the BIOM file format, Bioconductor v1.18.0

# Set seed
set.seed(1910)

Set the path to the fastq files:

path <- here::here("data/raw/casava-18-paired-end-demultiplexed-run1")
head(list.files(path))
## [1] "AqFl1-002_S1_L001_R1_001.fastq.gz"  "AqFl1-002_S1_L001_R2_001.fastq.gz" 
## [3] "AqFl1-003_S10_L001_R1_001.fastq.gz" "AqFl1-003_S10_L001_R2_001.fastq.gz"
## [5] "AqFl1-005_S19_L001_R1_001.fastq.gz" "AqFl1-005_S19_L001_R2_001.fastq.gz"

Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.

# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq.gz and SAMPLENAME_R2_001.fastq.gz
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "AqFl1-002" "AqFl1-003" "AqFl1-005" "AqFl1-007" "AqFl1-009" "AqFl1-011"

2 Inspect read quality

We start by visualizing the quality profiles of the forward reads:

plotQualityProfile(fnFs[1:2])

In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quantiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).

The forward reads are good quality. We will truncate the forward reads at position 260.

Now we visualize the quality profile of the reverse reads:

plotQualityProfile(fnRs[1:2])

The reverse reads are of significantly worse quality, which is common in Illumina sequencing. This isn’t too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 188 where the quality distribution crashes.

Considerations: the reads must still overlap after truncation in order to merge them later! For less-overlapping primer sets, like the V1-V2 we used for the present study, the truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them.

3 Filter and Trim

Assign filenames for the filtered fastq files.

# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

We’ll use standard filtering parameters: maxN = 0 (DADA2 requires no Ns), truncQ = 2, rm.phix = TRUE and maxEE = 2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores. We’ll also trim off the primer sequence from the forward and reverse reads by setting trimLeft = c(20, 18). Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(20, 18),
                     truncLen = c(260,188), maxN = 0, maxEE = c(2,2), truncQ = 2, 
                     rm.phix = TRUE, compress = TRUE, multithread = TRUE) 
head(out)
##                                    reads.in reads.out
## AqFl1-002_S1_L001_R1_001.fastq.gz    146844     91434
## AqFl1-003_S10_L001_R1_001.fastq.gz   173709    111395
## AqFl1-005_S19_L001_R1_001.fastq.gz   197511    127471
## AqFl1-007_S28_L001_R1_001.fastq.gz   148087     89004
## AqFl1-009_S37_L001_R1_001.fastq.gz   156341     99848
## AqFl1-011_S46_L001_R1_001.fastq.gz   200487    126400

Considerations: The standard filtering parameters are starting points, not set in stone. If speeding up downstream computation is needed, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE = c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads we must maintain overlap after truncation in order to merge them later.

4 Learn the error rates

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

errF <- learnErrors(filtFs, multithread = TRUE)
## 100632960 total bases in 419304 reads from 4 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = TRUE)
## 109743840 total bases in 645552 reads from 6 samples will be used for learning the error rates.

It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:

plotErrors(errF, nominalQ = TRUE)

plotErrors(errR, nominalQ = TRUE)

The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.

5 Sample inference

We are now ready to apply the core sample inference algorithm to the data. By default, the dada function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. Here, we set pool = TRUE to retain the singletons for the downstream species richness estimation.

dadaFs <- dada(filtFs, err = errF, pool = FALSE, multithread = TRUE) 
## Sample 1 - 91434 reads in 19535 unique sequences.
## Sample 2 - 111395 reads in 18935 unique sequences.
## Sample 3 - 127471 reads in 17807 unique sequences.
## Sample 4 - 89004 reads in 14665 unique sequences.
## Sample 5 - 99848 reads in 22868 unique sequences.
## Sample 6 - 126400 reads in 23532 unique sequences.
## Sample 7 - 93081 reads in 19859 unique sequences.
## Sample 8 - 104057 reads in 22434 unique sequences.
## Sample 9 - 110017 reads in 15578 unique sequences.
## Sample 10 - 145714 reads in 29189 unique sequences.
## Sample 11 - 121015 reads in 18796 unique sequences.
## Sample 12 - 98960 reads in 18693 unique sequences.
## Sample 13 - 114758 reads in 22850 unique sequences.
## Sample 14 - 160717 reads in 25759 unique sequences.
## Sample 15 - 99795 reads in 17918 unique sequences.
## Sample 16 - 126047 reads in 22290 unique sequences.
## Sample 17 - 86725 reads in 13270 unique sequences.
## Sample 18 - 140566 reads in 27100 unique sequences.
## Sample 19 - 134429 reads in 24959 unique sequences.
## Sample 20 - 91257 reads in 11735 unique sequences.
## Sample 21 - 161003 reads in 25235 unique sequences.
## Sample 22 - 167202 reads in 30632 unique sequences.
## Sample 23 - 152777 reads in 26904 unique sequences.
## Sample 24 - 118126 reads in 22259 unique sequences.
## Sample 25 - 116236 reads in 23578 unique sequences.
## Sample 26 - 188716 reads in 31841 unique sequences.
## Sample 27 - 82010 reads in 15682 unique sequences.
## Sample 28 - 126856 reads in 23821 unique sequences.
## Sample 29 - 88170 reads in 19033 unique sequences.
## Sample 30 - 138640 reads in 26382 unique sequences.
## Sample 31 - 53075 reads in 17030 unique sequences.
## Sample 32 - 96249 reads in 17138 unique sequences.
## Sample 33 - 121832 reads in 21463 unique sequences.
## Sample 34 - 120237 reads in 16232 unique sequences.
## Sample 35 - 101884 reads in 13695 unique sequences.
## Sample 36 - 149692 reads in 26767 unique sequences.
## Sample 37 - 127487 reads in 28577 unique sequences.
## Sample 38 - 127688 reads in 17719 unique sequences.
## Sample 39 - 79140 reads in 14738 unique sequences.
## Sample 40 - 94017 reads in 19750 unique sequences.
## Sample 41 - 133050 reads in 24324 unique sequences.
## Sample 42 - 177248 reads in 25804 unique sequences.
## Sample 43 - 239233 reads in 42943 unique sequences.
## Sample 44 - 84176 reads in 16773 unique sequences.
## Sample 45 - 158908 reads in 24896 unique sequences.
## Sample 46 - 88345 reads in 16241 unique sequences.
## Sample 47 - 83752 reads in 16176 unique sequences.
## Sample 48 - 96472 reads in 8922 unique sequences.
## Sample 49 - 112238 reads in 9047 unique sequences.
## Sample 50 - 133458 reads in 9892 unique sequences.
## Sample 51 - 130657 reads in 11008 unique sequences.
## Sample 52 - 79466 reads in 7201 unique sequences.
## Sample 53 - 155678 reads in 11827 unique sequences.
## Sample 54 - 164385 reads in 13149 unique sequences.
## Sample 55 - 98153 reads in 9239 unique sequences.
## Sample 56 - 125481 reads in 10033 unique sequences.
## Sample 57 - 85778 reads in 7278 unique sequences.
## Sample 58 - 66805 reads in 6951 unique sequences.
## Sample 59 - 108500 reads in 9191 unique sequences.
## Sample 60 - 137050 reads in 20611 unique sequences.
## Sample 61 - 156209 reads in 19523 unique sequences.
## Sample 62 - 133765 reads in 16744 unique sequences.
## Sample 63 - 121069 reads in 9394 unique sequences.
## Sample 64 - 86818 reads in 7517 unique sequences.
## Sample 65 - 52691 reads in 4907 unique sequences.
dadaRs <- dada(filtRs, err = errR, pool = FALSE, multithread = TRUE)
## Sample 1 - 91434 reads in 23989 unique sequences.
## Sample 2 - 111395 reads in 25534 unique sequences.
## Sample 3 - 127471 reads in 23524 unique sequences.
## Sample 4 - 89004 reads in 24714 unique sequences.
## Sample 5 - 99848 reads in 25188 unique sequences.
## Sample 6 - 126400 reads in 31109 unique sequences.
## Sample 7 - 93081 reads in 25472 unique sequences.
## Sample 8 - 104057 reads in 23335 unique sequences.
## Sample 9 - 110017 reads in 25336 unique sequences.
## Sample 10 - 145714 reads in 32859 unique sequences.
## Sample 11 - 121015 reads in 25244 unique sequences.
## Sample 12 - 98960 reads in 29789 unique sequences.
## Sample 13 - 114758 reads in 28585 unique sequences.
## Sample 14 - 160717 reads in 36960 unique sequences.
## Sample 15 - 99795 reads in 24801 unique sequences.
## Sample 16 - 126047 reads in 28281 unique sequences.
## Sample 17 - 86725 reads in 22570 unique sequences.
## Sample 18 - 140566 reads in 32636 unique sequences.
## Sample 19 - 134429 reads in 29726 unique sequences.
## Sample 20 - 91257 reads in 24439 unique sequences.
## Sample 21 - 161003 reads in 38012 unique sequences.
## Sample 22 - 167202 reads in 39088 unique sequences.
## Sample 23 - 152777 reads in 38542 unique sequences.
## Sample 24 - 118126 reads in 28318 unique sequences.
## Sample 25 - 116236 reads in 28507 unique sequences.
## Sample 26 - 188716 reads in 41129 unique sequences.
## Sample 27 - 82010 reads in 18172 unique sequences.
## Sample 28 - 126856 reads in 38250 unique sequences.
## Sample 29 - 88170 reads in 25204 unique sequences.
## Sample 30 - 138640 reads in 32406 unique sequences.
## Sample 31 - 53075 reads in 21363 unique sequences.
## Sample 32 - 96249 reads in 21215 unique sequences.
## Sample 33 - 121832 reads in 28900 unique sequences.
## Sample 34 - 120237 reads in 22907 unique sequences.
## Sample 35 - 101884 reads in 16958 unique sequences.
## Sample 36 - 149692 reads in 38875 unique sequences.
## Sample 37 - 127487 reads in 32709 unique sequences.
## Sample 38 - 127688 reads in 24975 unique sequences.
## Sample 39 - 79140 reads in 21731 unique sequences.
## Sample 40 - 94017 reads in 20659 unique sequences.
## Sample 41 - 133050 reads in 36374 unique sequences.
## Sample 42 - 177248 reads in 37350 unique sequences.
## Sample 43 - 239233 reads in 48038 unique sequences.
## Sample 44 - 84176 reads in 25194 unique sequences.
## Sample 45 - 158908 reads in 35797 unique sequences.
## Sample 46 - 88345 reads in 21065 unique sequences.
## Sample 47 - 83752 reads in 22898 unique sequences.
## Sample 48 - 96472 reads in 11900 unique sequences.
## Sample 49 - 112238 reads in 14803 unique sequences.
## Sample 50 - 133458 reads in 15357 unique sequences.
## Sample 51 - 130657 reads in 14840 unique sequences.
## Sample 52 - 79466 reads in 13345 unique sequences.
## Sample 53 - 155678 reads in 17633 unique sequences.
## Sample 54 - 164385 reads in 19296 unique sequences.
## Sample 55 - 98153 reads in 12862 unique sequences.
## Sample 56 - 125481 reads in 13212 unique sequences.
## Sample 57 - 85778 reads in 13079 unique sequences.
## Sample 58 - 66805 reads in 10062 unique sequences.
## Sample 59 - 108500 reads in 14740 unique sequences.
## Sample 60 - 137050 reads in 33586 unique sequences.
## Sample 61 - 156209 reads in 27179 unique sequences.
## Sample 62 - 133765 reads in 23726 unique sequences.
## Sample 63 - 121069 reads in 16034 unique sequences.
## Sample 64 - 86818 reads in 10645 unique sequences.
## Sample 65 - 52691 reads in 7732 unique sequences.

Inspect the dada-class object

dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 600 sequence variants were inferred from 19535 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

6 Merge paired reads

We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)

# Inspect the merger data.frame from the first sample
head(mergers[[1]])
##                                                                                                                                                                                                                                                                                                   sequence
## 1 CCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGGAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 2 CCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 3     CCTAATACATGCAAGTCGAACGCTGCTTTTTCCACCGAAGCTTGCTTCACCGGAAAAAGCGGAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCTTCAGCGGGGGATAACACTTGGAAACAGGTGCTAATACCGCATAGGATTTCTGTTCGCATGAACGGAGAAGGAAAGACGGCGTAAGCTGTCACTGAAGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAATGATGCATAGCCGACCTGAGAGGGTAATCGGCCACACTGGG
## 4   CCTAATACATGCAAGTCGAGCGAGTGAAATTAACTGATCCCTTCGGGGTGACGTTAATGGATCTAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTGCCTGTAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAACACTTCGAACCACATGGTTTGAAGATGAAAGGCGGCGCAAGCTGTCACTTACAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 5                 CCTAATACATGCAAGTCGAGCGGACGGATGGGAGCTTGCTCCCAGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTGCCCTATAGTTGGGGATAACTCCGGGAAACCGGGGCTAATACCGAATGATATAATTTAGCTCCTGCTAGATTGTTGAAAGATGGTTTACGCTATCGCTATAGGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 6 CCTAATACATGCAAGTCGAGCGCGGGAATCTCTTCTGATCCCTTCGGGGTGAAGAGAGTGGAACGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAGTAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGACCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1      5426       1       2    114         0      0      1   TRUE
## 2      4221       2       1    114         0      0      2   TRUE
## 3      3590       3       5    118         0      0      1   TRUE
## 4      3487       4       3    116         0      0      2   TRUE
## 5      2791       5       6    130         0      0      1   TRUE
## 6      2661       6       7    114         0      0      1   TRUE

Considerations: Most of reads should successfully merge. If that is not the case, upstream parameters may need to be revisited: Did we trim away the overlap between the reads?

7 Construct sequence table

We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.

seqtab <- makeSequenceTable(mergers)

The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.

dim(seqtab)
## [1]    65 10435

Inspect distribution of sequence lengths

table(nchar(getSequences(seqtab))) %>% 
  as.data.frame() %>% 
  rename("seqence length" = Var1) %>% 
  datatable(options = list(
    columnDefs = list(list(className = 'dt-left', targets = c(0:2)))
    ))

Plot sequence length distribution

seqLen <- nchar(getSequences(seqtab)) %>% 
  as.data.frame() %>% 
  rename(seqLen = ".") 

ggplot(seqLen, aes(x = seqLen)) + 
  geom_histogram(binwidth = 1, alpha = 0.2, position = "identity", colour = "red") +
  geom_vline(aes(xintercept = mean(seqLen)), color = "blue", linetype = "dashed", size = 0.5) +
  scale_x_continuous(breaks = seq(round_any(min(seqLen), 10), 
                                round_any(max(seqLen), 10, f = ceiling), 
                                round_any((max(seqLen)-min(seqLen))/10, 10, f = ceiling))) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)), 
                     breaks = seq(0, 
                                round_any(max(table(seqLen)), 10, f = ceiling), 
                                round_any(max(table(seqLen))/10, 10, f = ceiling))) +
  labs(x = "sequence length (bp)", title = "Amplicon length distribution") +
  annotate("text", label = "mean length", x = mean(seqLen$seqLen)-2, y = max(table(seqLen)), hjust = 1, colour = "blue") +
  theme_bw() 

Considerations: Sequences that are much longer or shorter than expected may be the result of non-specific priming. The sequence lengths fall within the range of the expected amplicon sizes, most of which fall between 240 and 311 with a long tail on the right. We’ll just leave them as they are.

8 Remove chimeras

The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences. As we used pool = TRUE during the sample inference, we should use method = "pooled" for the chimera removal.

seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE, verbose = TRUE)
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9708676
dim(seqtab.nochim)
## [1]   65 6058

The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.

Considerations: Most of the reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of the reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.

9 Track reads through the pipeline

As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:

getN <- function(x) sum(getUniques(x))
stats <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(stats) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(stats) <- sample.names
datatable(stats)

Plot the sequences stats:

p_stats <- stats %>% 
  as.data.frame() %>%
  rownames_to_column("SampleID") %>%
  mutate_at(vars("filtered":"nonchim"), ~100*.x/input) %>% 
  mutate(input = 100) %>%
  gather(key = "step", value = "percent", -SampleID) %>%
  mutate(step = factor(step, levels = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim"))) %>%
  ggplot(aes(x = step, y = percent, color = SampleID)) +
    geom_point() +
    geom_line(aes(group = SampleID)) +
    scale_y_continuous(breaks = 0:10*10) +
    labs(x = "", y = "Reads retained (%)") +
    theme_bw()

ggplotly(p_stats, tooltip = c("x", "y", "colour"))

Looks good! Except for the filtering step, we kept the majority of our raw reads.

Considerations: This is a great place to do a last sanity check. Outside of filtering, there should be no step in which a majority of reads are lost. If a majority of reads failed to merge, one may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span the amplicon. If a majority of reads were removed as chimeric, one may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.

10 Export data

Export feature table:

t(seqtab.nochim) %>%
  make_biom() %>%
  write_biom(here::here("data/intermediate/dada2/table-run1.biom"))

Export representative sequences:

uniquesToFasta(seqtab.nochim, fout = here::here("data/intermediate/dada2/rep-seqs-run1.fna"), 
               ids = colnames(seqtab.nochim))

11 Acknowledgements

The processing of raw sequence data into an ASV table is based on the online DADA2 tutorial (1.12). For more documentations and tutorials, visit the DADA2 website.

12 Session information

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=nb_NO.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=nb_NO.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=nb_NO.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] biomformat_1.18.0 plotly_4.9.3      DT_0.17           plyr_1.8.6       
##  [5] dada2_1.18.0      Rcpp_1.0.6        forcats_0.5.1     stringr_1.4.0    
##  [9] dplyr_1.0.5       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3      
## [13] tibble_3.1.0      ggplot2_3.3.3     tidyverse_1.3.0   here_1.0.1       
## [17] knitr_1.31       
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-0            hwriter_1.3.2              
##   [3] ellipsis_0.3.1              rprojroot_2.0.2            
##   [5] htmlTable_2.1.0             XVector_0.30.0             
##   [7] GenomicRanges_1.42.0        base64enc_0.1-3            
##   [9] fs_1.5.0                    rstudioapi_0.13            
##  [11] farver_2.1.0                fansi_0.4.2                
##  [13] lubridate_1.7.10            xml2_1.3.2                 
##  [15] codetools_0.2-18            splines_4.0.5              
##  [17] Formula_1.2-4               jsonlite_1.7.2             
##  [19] Rsamtools_2.6.0             broom_0.7.5                
##  [21] cluster_2.1.1               dbplyr_2.1.0               
##  [23] png_0.1-7                   compiler_4.0.5             
##  [25] httr_1.4.2                  backports_1.2.1            
##  [27] lazyeval_0.2.2              assertthat_0.2.1           
##  [29] Matrix_1.3-2                cli_2.3.1                  
##  [31] htmltools_0.5.1.1           tools_4.0.5                
##  [33] gtable_0.3.0                glue_1.4.2                 
##  [35] GenomeInfoDbData_1.2.4      reshape2_1.4.4             
##  [37] ShortRead_1.48.0            Biobase_2.50.0             
##  [39] cellranger_1.1.0            jquerylib_0.1.3            
##  [41] rhdf5filters_1.2.0          vctrs_0.3.6                
##  [43] Biostrings_2.58.0           debugme_1.1.0              
##  [45] crosstalk_1.1.1             xfun_0.22                  
##  [47] ps_1.6.0                    rvest_1.0.0                
##  [49] lifecycle_1.0.0             zlibbioc_1.36.0            
##  [51] scales_1.1.1                hms_1.0.0                  
##  [53] MatrixGenerics_1.2.1        parallel_4.0.5             
##  [55] SummarizedExperiment_1.20.0 rhdf5_2.34.0               
##  [57] RColorBrewer_1.1-2          yaml_2.2.1                 
##  [59] gridExtra_2.3               sass_0.3.1                 
##  [61] rpart_4.1-15                latticeExtra_0.6-29        
##  [63] stringi_1.5.3               highr_0.8                  
##  [65] S4Vectors_0.28.1            checkmate_2.0.0            
##  [67] BiocGenerics_0.36.0         BiocParallel_1.24.1        
##  [69] GenomeInfoDb_1.26.4         rlang_0.4.10               
##  [71] pkgconfig_2.0.3             bitops_1.0-6               
##  [73] matrixStats_0.58.0          evaluate_0.14              
##  [75] lattice_0.20-41             Rhdf5lib_1.12.1            
##  [77] labeling_0.4.2              GenomicAlignments_1.26.0   
##  [79] htmlwidgets_1.5.3           tidyselect_1.1.0           
##  [81] magrittr_2.0.1              R6_2.5.0                   
##  [83] IRanges_2.24.1              generics_0.1.0             
##  [85] Hmisc_4.5-0                 DelayedArray_0.16.2        
##  [87] DBI_1.1.1                   pillar_1.5.1               
##  [89] haven_2.3.1                 foreign_0.8-81             
##  [91] withr_2.4.1                 survival_3.2-10            
##  [93] RCurl_1.98-1.3              nnet_7.3-15                
##  [95] modelr_0.1.8                crayon_1.4.1               
##  [97] utf8_1.2.1                  rmarkdown_2.7              
##  [99] jpeg_0.1-8.1                grid_4.0.5                 
## [101] readxl_1.3.1                data.table_1.14.0          
## [103] reprex_1.0.0                digest_0.6.27              
## [105] RcppParallel_5.0.3          stats4_4.0.5               
## [107] munsell_0.5.0               viridisLite_0.3.0          
## [109] bslib_0.2.4